Copy, please, these files and directories to your personal directory:

cp -r /data/shared/AGE2020/Exercises/E02-intro_to_advanced_R ~/AGE2020/Exercises

And switch the R working directory to the current exercise: setwd("~/AGE2020/Exercises/E02-intro_to_advanced_R")


Introduction to tidyverse

Until now you have been probably using only the base R. Tidyverse can be seen as a dialect of R and is described as follows:

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Tidyverse is very popular these days, but as some (here and here) have noted, it is not suitable for R beginners.

Useful links

Tidy data

Tidyverse packages are not meant to replace the base R. Especially, they are not suited for matrix manipulation, and are specifically used for long data (also called tidy data). More about tidy data can be found here.

Following three rules makes a dataset tidy:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells.

As you will see, we will be mostly working with biological matrices (= data in wide format). For example, consider this datafame:

(df_wide <- data.frame(sample_1 = 1:2, sample_2 = 3:4, sample_3 = 5:6, row.names = c("gene_A", "gene_B")))

For each sample (column) and each gene (row) we have its value (cell). This form of data is usable for functions operating on matrices (e.g. PCA), but as you will see later, not for e.g. plotting (besides base R plotting).

This dataframe is wide, and thus it is not tidy. You can see that rownames and colnames are, in fact, values of variables Gene and Sample, respectively. So how would this dataframe look like in tidy format? Don’t be surprised now, we will get to this code later.

library(magrittr)

df_long <- df_wide %>%
  tibble::rownames_to_column("Gene") %>%
  tidyr::pivot_longer(-Gene, names_to = "Sample", values_to = "Count")
df_long

Now df_long corresponds to tidy format of the original dataframe. We actually did this:

Converting wide data to tidy format.

Why is tidy data useful for some purposes (mainly for plotting)? Well, for biological matrix like the one in df, you usually also have a sample sheet, for example:

(sample_sheet <- data.frame(Sample = c("sample_1", "sample_2", "sample_3"), Age = 1:3, Treatment = c("yes", "no", "yes")))

Would it be neat to have this information alongside the tidy data (df_long)? I think so:

dplyr::left_join(df_long, sample_sheet, by = "Sample")

As you will see later, this tidy form of data is extremely useful for plotting. In this blog post, there is a nice example of using tidy data and tidyverse in genomics.

Non-standard evaluation (NSE)

Let’s go ahead a bit and look at this select() function from dplyr package (will be introduced later):

dplyr::select(mtcars, cyl, mpg) %>% head()

You can see that select() is filtering columns of the dataframe. But it has a strange syntax, hasn’t it? We have used variables, which are not defined. Don’t be scared by this dark magic - it is based on R’s lazy evaluation (details) and it’s called non-standard evaluation (NSE). In short, R only evaluates the expression if the expression is actually used.

Tidyverse functions use NSE to look first at variables inside a passed object (columns = variables). Most of these functions are designed to have their first parameter as data object. When variables are not found inside the object, function will look for variables in the global environment (this can cause unwanted problems, when object has variable with the same name as the one in global environment). So in the example above, cyl and mpg variables are evaluated in the context of mtcars object and environment defined in select() function call.

NSE is a quite advanced topic and its knowledge is mainly needed for implementation of custom functions, which share the tidyverse philosophy. See vignette("programming").

Now we will quickly look at the most popular tidyverse packages.

magrittr - pipe operator

The pipe operator is probably the basic compound of shell programming. Fortunately for us, it is also implemented in R.

The pipe operator (see ?`%>%`) is forwarding output from left hand side (LHS) to input on right hand side (RHS). The default behaviour is to pipe output to the first argument of a RHS function.

So instead of using nested functions like

head(rownames(mtcars))
## [1] "Mazda RX4"         "Mazda RX4 Wag"     "Datsun 710"       
## [4] "Hornet 4 Drive"    "Hornet Sportabout" "Valiant"

you can use the pipe operator:

rownames(mtcars) %>% head()
## [1] "Mazda RX4"         "Mazda RX4 Wag"     "Datsun 710"       
## [4] "Hornet 4 Drive"    "Hornet Sportabout" "Valiant"

And there are even more types of pipes.

Piped object is available on RHS as the . object. It is used when you want:

  • To use an argument different than first of the RHS function:
c("first", "second", "third") %>% paste("I am", ., sep = " ")
## [1] "I am first"  "I am second" "I am third"
  • To use the LHS output as a regular R object:
mtcars %>% .$cyl
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars %>% .[1:5, 1:3]
  • To use a lambda expression. They behave like a function with single parameter .:
mtcars %>% {head(.$cyl)}
## [1] 6 6 4 6 8 6
mtcars %>% {
  my_data <- .
  c(min(my_data), max(my_data))
}
## [1]   0 472

Or you can define an anonymous function:

mtcars %>% (function(my_data) c(min(my_data), max(my_data)))
## [1]   0 472

Using arithmetic and other operators together with pipe can be more pleasant with provided aliases.

Add and subtract operators are working fine:

1 %>% + 2 %>% - 3
## [1] 0

But for the sake of consistency, rather use operator aliases:

1 %>% add(2) %>% subtract(3)
## [1] 0

However, how would you use, for example, == or * operator?

1 %>% == 2
1 %>% * 3
## Error: <text>:1:7: unexpected '=='
## 1: 1 %>% ==
##           ^

You have to use aliases:

1 %>% equals(2)
## [1] FALSE
1 %>% multiply_by(3)
## [1] 3

In fact, all aliases are actually shorthands for writing operators as “classic” functions, e.g. `*`(x, y). So the examples above can be also written as:

1 %>% `==`(2)
## [1] FALSE
1 %>% `*`(3)
## [1] 3

tibble - enhanced data.frame

There are some differences to data.frame (see link above and examples here), but the most visible difference is in printing of a tibble:

tibble::as_tibble(mtcars)

tibble is showing the data types of variables. Also you can see that there aren’t rownames - this is strange, but it is one of the features of tibble. Authors of tibble package are convinced that rownames should be represented by variable (column) and not by object’s attribute, accessible by rownames() method.

This is bad particularly for usage of tidyverse together with Bioconductor, where rownames are an essential component. Many tidyverse functions return a tibble or drop rownames, but most of times we rather want data.frame with defined rownames. tibble can be converted to data.frame by as.data.frame() function.

Useful is the tribble() function for more readable creation of (small) tables in code:

tibble::tribble(
  ~column_A, ~column_B,
  "a",   1,
  "b",   2,
  "c",   3
)

To preserve rownames in a new column, use the tibble::rownames_to_column("name_of_new_column") function.

dplyr - data manipulation

dplyr is probably the most popular package from tidyverse. It helps to solve the most common data manipulation challenges and its sibling dbplyr also supports remote data stored in a database (MySQL, PostreSQL, SQLite, and others).

Some functions in dplyr have names similar to functions from other packages. Thus, it is recommended to always specify the dplyr:: namespace.

select() - select (filter) columns

library(dplyr)
mtcars_sm <- head(mtcars)
dplyr::select(mtcars_sm, cyl, mpg)

By using - you can also specifify columns to omit:

dplyr::select(mtcars_sm, -cyl, -mpg)

We can select a range of variables. Numerical range is also supported, e.g. 1:5. Omitting a range is also possible - in thix example you would use -(disp:drat).

dplyr::select(mtcars_sm, disp:drat)

There are also very useful select helpers. For omitting they can be also combined with -.

Select columns starting with “c”:

dplyr::select(mtcars_sm, starts_with("c"))

Select columns containing “ar”:

dplyr::select(mtcars_sm, contains("ar"))

Select column at last - 1 position:

dplyr::select(mtcars_sm, last_col(1))

Select columns having names in vector and don’t throw error on non-existing column:

dplyr::select(mtcars_sm, one_of(c("mpg", "cyl", "garbage")))

Because this will fail:

dplyr::select(mtcars_sm, mpg, cyl, garbage)
## Error in .f(.x[[i]], ...): object 'garbage' not found

select() can also select and rename columns at the same time:

dplyr::select(mtcars_sm, cylinders = cyl, horse_power = hp)

To only rename specific and keep other columns, use rename() function:

dplyr::rename(mtcars_sm, cylinders = cyl, horse_power = hp)

Using the everything() select helper also works, but you will lose the original column order:

dplyr::select(mtcars_sm, cylinders = cyl, horse_power = hp, everything())

select() always returns an object of the same type as input. However, sometimes you rather need a vector: pull() returns a vector of specified variable.

dplyr::pull(mtcars_sm, cyl)
## [1] 6 6 4 6 8 6

arrange() - order rows

arrange() orders rows by specified variables.

dplyr::arrange(mtcars_sm, cyl, disp)

Use desc() or - for descending order:

dplyr::arrange(mtcars_sm, desc(cyl))
# or
# dplyr::arrange(mtcars_sm, -cyl)

filter() - filter rows

In fact, filter() is very similar to base R function subset(), but offers much more functionality.

Multiple arguments are equivalent to AND operator:

dplyr::filter(mtcars_sm, hp > 100, cyl == 6)

So this is equivalent:

dplyr::filter(mtcars_sm, hp > 100 & cyl == 6)

Thanks to NSE we can do some magic. Here, we want rows where hp is greater than mean value of hp minus 10:

dplyr::filter(mtcars_sm, hp > mean(hp) - 10)

mutate() - add new or modify existing variables

Named parameters of mutate() function create new variables (or modify existing, if a parameter name is same).

tibble::rownames_to_column(mtcars_sm, "car_name") %>%
  dplyr::mutate(cyl2 = cyl * 2) %>%
  dplyr::select(car_name, cyl, cyl2)

mutate() is one of the tidyverse functions, which silently drop rownames. To preserve them, we have moved rownames to the new column car_name.

This is how we remove the column and modify the existing one:

dplyr::mutate(mtcars_sm, mpg = NULL, cyl = cyl * 2)

Very useful are these “if” functions:

if_else() checks for a condition and returns values for TRUE and FALSE matches. In missing parameter you can also specify a return value for NA values. if_else() is similar to base R ifelse(), but checks for the same return types.

dplyr::mutate(mtcars_sm, car_category = if_else(hp >= 100, "strong", "weak")) %>%
  dplyr::select(hp, car_category)

You can see that return values must be the same type:

dplyr::mutate(mtcars_sm, car_category = if_else(hp >= 100, "strong", 50)) %>%
  dplyr::select(hp, car_category)
## `false` must be a character vector, not a double vector

recode() replaces values, similar to switch(). You can specify default return value for not-specified and NA values (.default and .missing parameters). Very useful for replacing factor values while preserving their order.

dplyr::mutate(mtcars_sm, am_char = recode(am, `0` = "zero", `1` = "one")) %>%
  dplyr::select(am, am_char)

recode_factor() create factors with levels ordered as they appear in the recode call.

factor(mtcars_sm$am) %>%
  levels()
## [1] "0" "1"
# You can notice that mutate() parameters are "executed" from first to last.
# So to "am_char" is already passed "am" as factor, as "am" is mutated before "am_char".
dplyr::mutate(
  mtcars_sm,
  am = factor(am),
  am_char = recode_factor(am, `1` = "one", `0` = "zero")
) %>%
  dplyr::pull(am_char) %>%
  levels()
## [1] "one"  "zero"

case_when() allows to specify multiple if statements. Statements are given in a format condition ~ return_value and they are matched from first to last.

For else statement put TRUE ~ some_value as the last statement (if not specified, some_value will be NA). As case_when() is checking for return types, you have to use appropriate NA type: NA_integer_, NA_real_, NA_complex_ or NA_character_. Be careful since 1 is a real number (numeric type), whereas 1L is a true integer.

dplyr::mutate(
  mtcars,
  car_category = case_when(
    hp < 80 ~ "weak",
    hp >= 80 & hp <= 100 ~ "middle",
    hp > 100 ~ "strong"
  )
) %>%
  dplyr::select(hp, car_category)

All these “if” functions are vectorised and can be also used for vectors:

v <- c(1, 1, 2, 3, 10)
if_else(v < 2, "< 2", ">= 2")
## [1] "< 2"  "< 2"  ">= 2" ">= 2" ">= 2"

group_by() and summarise()

group_by() is similar to SQL operation of the same name. It takes an existing dataframe and converts it into a grouped dataframe where operations are performed “by group”.

(by_cyl <- dplyr::group_by(mtcars, cyl))

Better is to see the original tibble in your console, where is information about groups.

Now we can perform operations on groups. summarise() reduces multiple values down to a single value. Let’s calculate some statistics in the cyl groups:

dplyr::summarise(
  by_cyl,
  mpg_mean = mean(mpg),
  hp_max = max(hp),
  qsec_median = median(qsec),
  wt_sd = sd(wt)
)

n() is a special function to get counts inside groups:

dplyr::summarise(
  by_cyl,
  n = n()
)

filter() is now acting on groups. Let’s compare these two filter operations:

dplyr::filter(mtcars, mpg > mean(mpg))
dplyr::filter(by_cyl, mpg > mean(mpg))

The former keeps rows with mpg greater than the global average whereas the latter keeps rows with mpg greater than the average inside the cyl groups.

To remove grouping, use the ungroup() function.

*_join() and bind_rows() - joining dataframes

A collection of *_join() functions acts in the same way as in SQL - based on common keys, they join two dataframes together.

mtcars_prices <- tibble::tribble(
  ~cyl, ~price,
  4,    1000,
  6,    5000,
  8,    10000
)

dplyr::left_join(mtcars_sm, mtcars_prices, by = "cyl") %>%
  dplyr::select(cyl, price)

bind_rows() append rows to dataframe. It puts NAs in columns not present in the second dataframe.

dplyr::bind_rows(mtcars_sm, mtcars_prices) %>%
  dplyr::select(cyl, price, everything())

tidyr - tools for tidy data

The goal of tidyr is to help you create tidy data. Just to remind you, tidy data is data where:

  • Every column is variable.
  • Every row is an observation.
  • Every cell is a single value.

Cheatsheet on reading and tidying data.

Let’s look at some of basic tasks tidyr can do:

Pivotting

Pivotting converts between long and wide forms. We will reuse the wide df_wide dataframe.

pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns.

df_wide
df_long <- df_wide %>%
  tibble::rownames_to_column("Gene") %>%
  tidyr::pivot_longer(-Gene, names_to = "Sample", values_to = "Count")
df_long

We have added rownames of df_wide to the new column Gene. In pivot_longer() we specified by -Gene we want to lengthen this column.

Of course more than one column can be lengthened. Let’s add more columns to df_wide:

df_wide2 <- df_wide %>%
  tibble::rownames_to_column("Gene_symbol") %>%
  dplyr::mutate(
    Gene_ID = c("ENSG0001", "ENSG0002"),
    Gene_class = c("protein_coding", "miRNA")
  )

df_wide2
df_long2 <- df_wide2 %>%
  tidyr::pivot_longer(-c(Gene_symbol, Gene_ID, Gene_class), names_to = "Sample", values_to = "Count")

df_long2

We can see gene columns have the same prefix, and so we can utilize that: select helpers are allowed here.

df_wide2 %>%
  tidyr::pivot_longer(-starts_with("Gene_"), names_to = "Sample", values_to = "Count")

pivot_wider() “widens” data, increasing the number of columns and decreasing the number of rows.

tidyr::pivot_wider(df_long, names_from = "Sample", values_from = "Count")
tidyr::pivot_wider(df_long2, names_from = "Sample", values_from = "Count")

Splitting and combining character columns

separate() turns a single character column into multiple columns.

(df <- data.frame(x = c(NA, "a.b", "a.d", "b.c")))
tidyr::separate(df, x, c("A", "B"), sep = "\\.")

extract() turns each capturing group into a new column.

(df <- data.frame(x = c(NA, "a-1-x", "a-2-y", "a-3")))
tidyr::extract(df, x, c("A", "B", "C"), regex = "([a-z])-(\\d)-?([a-z])?")

unite() pastes together multiple columns into one.

(df <- data.frame(x = c(NA, "a", "b", "c"), y = c("a", "f", "g", "h")))
tidyr::unite(df, "x_y", x, y, sep = "_", remove = FALSE)

Other tidyr functions

drop_na() drops rows containing missing values.

(df <- data.frame(x = c(1, NA, 2), y = c(1, 2, NA)))
tidyr::drop_na(df)
tidyr::drop_na(df, x)

replace_na() replaces missing values.

For dataframes, specify a list of columns and replacements.

tidyr::replace_na(df, list(x = 5, y = 10))

For vectors, specify a replacement value.

tidyr::replace_na(c(1, NA, 3), 2)
## [1] 1 2 3

stringr - consistent wrappers for common string operations

stringr is a replacement for base R collection of string manipulation functions (e.g. grep(), gsub(), etc., see ?grep). In my opinion, stringr API is more user-friendly. All its functions start with str_ prefix and are vectorised over input string and pattern.

Useful links

  • Cheatsheet
  • RegExplain - RStudio addin providing a friendly interface for working with regular expressions and functions from stringr.
  • Regex101 - cool web application for regex testing.

Because stringr homepage offers a nice collection of examples, I will just copy-paste them here and possibly add more information.

library(stringr)

x <- c("why", "video", "cross", "extra", "deal", "authority")
str_length(x)
## [1] 3 5 5 5 4 9

str_c(..., sep = "", collapse = NULL) concatenates a vector of strings.

str_c(x, collapse = ", ")
## [1] "why, video, cross, extra, deal, authority"

str_sub(string, start = 1L, end = -1L) extracts a substring by a position. Positions can be also negative.

str_sub(x, 1, 2)
## [1] "wh" "vi" "cr" "ex" "de" "au"
str_sub(x, -2)
## [1] "hy" "eo" "ss" "ra" "al" "ty"

str_detect(string, pattern, negate = FALSE) tells you if there’s any match to the pattern. Handy in dplyr::filter().

str_detect(x, "[aeiou]")
## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
str_detect(x, "[aeiou]", negate = TRUE)
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

str_count(x, pattern) counts the number of patterns.

str_count(x, "[aeiou]")
## [1] 0 3 1 2 2 4

str_subset(string, pattern, negate = FALSE) extracts the matching components. str_which(string, pattern, negate = FALSE) gives indices of the matching components.

str_subset(x, "[aeiou]")
## [1] "video"     "cross"     "extra"     "deal"      "authority"
str_subset(x, "[aeiou]", negate = TRUE)
## [1] "why"
str_which(x, "[aeiou]")
## [1] 2 3 4 5 6

str_locate(string, pattern) gives the position of the match.

str_locate(x, "[aeiou]")
##      start end
## [1,]    NA  NA
## [2,]     2   2
## [3,]     3   3
## [4,]     1   1
## [5,]     2   2
## [6,]     1   1

str_extract(string, pattern) extracts the text of the match. Return value = vector.

x
## [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
str_extract(x, "[aeiou]")
## [1] NA  "i" "o" "e" "e" "a"

str_match(string, pattern) extracts parts of the match defined by parentheses (capturing groups). Return value = matrix, with rows being input vector values, and columns being capturing groups. First column represents the full match.

x
## [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
str_match(x, "(.)[aeiou](.)")
##      [,1]  [,2] [,3]
## [1,] NA    NA   NA  
## [2,] "vid" "v"  "d" 
## [3,] "ros" "r"  "s" 
## [4,] NA    NA   NA  
## [5,] "dea" "d"  "a" 
## [6,] "aut" "a"  "t"

str_replace(string, pattern, replacement) replaces the matches with new text. replacement also supports references of the form \1, \2, etc., which will be replaced with the contents of the respective matched group (created by ()). str_remove(string, pattern) is an alias for str_replace(string, pattern, "").

str_replace(x, "[aeiou]", "?")
## [1] "why"       "v?deo"     "cr?ss"     "?xtra"     "d?al"      "?uthority"
str_replace_all(x, "[aeiou]", "?")
## [1] "why"       "v?d??"     "cr?ss"     "?xtr?"     "d??l"      "??th?r?ty"
str_replace(x, "(a)", "\\1_")
## [1] "why"        "video"      "cross"      "extra_"     "dea_l"     
## [6] "a_uthority"
str_remove(x, "[aeiou]")
## [1] "why"      "vdeo"     "crss"     "xtra"     "dal"      "uthority"
str_remove_all(x, "[aeiou]")
## [1] "why"   "vd"    "crss"  "xtr"   "dl"    "thrty"

str_split(string, pattern, n = Inf, simplify = FALSE) splits up a string into multiple pieces.

str_split(c("a,b", "c,d,e"), ",")
## [[1]]
## [1] "a" "b"
## 
## [[2]]
## [1] "c" "d" "e"

glue - string interpolation

glue offers the same functionality, which you already know from e.g. Bash ("$variable") or Python (f"{variable}").

library(glue)
name <- "Fred"
age <- 50
anniversary <- as.Date("1991-10-12")
glue("My name is {name}, my age next year is {age + 1}, my anniversary is {format(anniversary, '%A, %B %d, %Y')}.")
## My name is Fred, my age next year is 51, my anniversary is Saturday, October 12, 1991.

Equivalently, we can break long string to multiple parameters on separate lines and use named arguments to assign temporary variables:

glue(
  "My name is {name},",
  " my age next year is {age + 1},",
  " my anniversary is {format(anniversary, '%A, %B %d, %Y')}.",
  name = "Joe",
  age = 40,
  anniversary = as.Date("2001-10-12")
)

ggplot2 - plotting made easy

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

To create a plot, you start with initialization of a basic ggplot2 object holding data, and then you add a geom, which tells how to visualize the data and how to map variables to aesthetics (color, size, etc.). This will provide you sensible defaults. To further modify the plot, you can add more components with + operator.

Useful links

ggplot2 is a very extensive library, and so you should mainly know its basic principles and search for details in the reference. We will just look at these basic principles.

library(ggplot2)
mtcars_sm
ggplot(mtcars) +
  geom_point(aes(x = cyl, y = mpg))

We have initialized the ggplot2 object with mtcars data and added geom_point(), which represents a visualization of points in coordinate system. In this geom, we mapped x to cyl and y to mpg variable, respectively.

Each geom_*() has available aesthetics to which you can map variables. Let’s add some colors to the previous plot:

ggplot(mtcars) +
  geom_point(aes(x = cyl, y = mpg, color = gear))

Hmmm but why is gear considered as a continuous variable?

head(mtcars$gear)
## [1] 4 4 4 3 3 3
class(mtcars$gear)
## [1] "numeric"

Because it is defined so.

ggplot2 has very sensible defaults, so, for example, in case of colors it uses continuous or discrete scale, based on variable type. Let’s see what happens when we change gear type to factor:

mtcars2 <- dplyr::mutate(mtcars, gear = factor(gear))
head(mtcars2$gear)
## [1] 4 4 4 3 3 3
## Levels: 3 4 5
ggplot(mtcars2) +
  geom_point(aes(x = cyl, y = mpg, color = gear))

Now you can see that ggplot2 has created the discrete scale, where each level of gear variable has its own color.

Because ggplot2 is a part of tidyverse, it also uses the non-standard evaluation, and so this is equivalent to the example above, but without modifying the mtcars object:

ggplot(mtcars2) +
  geom_point(aes(x = cyl, y = mpg, color = factor(gear)))

Aesthetics are inherited and overwritten by last definition. In this way you haven’t to repeat common aesthetics for each geom. This is equivalent to the example above:

ggplot(mtcars2, aes(x = cyl, y = mpg, color = factor(gear))) +
  geom_point()

Let’s introduce more ggplot2 components:

  • We map hp variable to point size in geom_point().
  • We add one more geom, geom_text(), which is similar to geom_point(), but shows text instead of points.
  • scale_color_manual() translates discrete values (factor levels) to colors. In general, scales control the details of how data values are translated to visual properties.
  • labs() adds labels to aesthetics and main plot areas (e.g. title).
  • theme_bw() is a predefined theme with black and white colors.
  • theme() customizes the non-data components of your plots. We can override settings in predefined theme_*() themes.
p <- ggplot(mtcars2, aes(x = qsec, y = mpg)) +
  geom_point(aes(color = factor(gear), size = hp)) +
  geom_text(aes(label = rownames(mtcars)), color = "black") +
  scale_color_manual(values = c("3" = "orchid4", "4" = "darkcyan", "5" = "darkorange1")) +
  labs(
    title = "mtcars", subtitle = "Acceleration vs. Fuel consumption",
    x = "1/4 mile time", y = "Miles per gallon", size = "Horse Power", color = "Gear"
  ) +
  theme_bw() +
  theme(
    plot.title = element_text(color = "chocolate4", face = "bold"),
    plot.subtitle = element_text(color = "orangered3"),
    axis.title = element_text(face = "bold")
  )

print(p)

Another useful ability of ggplot2 is facetting. facet_wrap() splits a plot to panels according to discrete variable, which is specified by formula. We split the plot by am variable (transmission: 0 = automatic, 1 = manual) and define a translation of labels for its levels.

mtcars$am
##  [1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
p + facet_wrap(~ am, labeller = labeller(am = c("0" = "Automatic", "1" = "Manual")))

We can specify multiple variables and plot is then created for every combination of their levels.

p + facet_wrap(~ am + cyl, labeller = labeller(am = c("0" = "Automatic", "1" = "Manual")))

Similar function is facet_grid() which creates a matrix of plots. Rows and columns are specified by LHS and RHS of formula.

p + facet_grid(am ~ cyl, labeller = labeller(am = c("0" = "Automatic", "1" = "Manual")))

Again, multiple variables are supported, for both LHS and RHS.

p + facet_grid(am + vs ~ cyl, labeller = labeller(am = c("0" = "Automatic", "1" = "Manual")))

Saving is done via ggsave() function. You can use any format R can handle.

ggsave("ggplot2_example.png", p)
ggsave("ggplot2_example.pdf", p)

ggplot2_example.png

ggplot2_example.pdf

Alternatively, you can use base R functions. This is useful for saving multiple plots to a single multipage PDF.

pdf("ggplot2_example.pdf")
print(p)
dev.off()

Libraries extending the ggplot2

ggpubr - publication-ready plots

ggpubr::ggboxplot(
  mtcars, x = "cyl", y = "qsec",
  color = "cyl", shape = "cyl",
  add = "jitter"
)

GGally - useful geoms

GGally::ggpairs(mtcars, columns = c("cyl", "mpg", "qsec"))

cowplot - aligning multiple plots

cowplot can do more than aligning plots. For example, it can draw images.

p1 <- ggplot(mtcars) + 
  geom_point(aes(x = disp, y = mpg))
p2 <- ggplot(mtcars) + 
  geom_point(aes(x = disp, y = hp))

cowplot::plot_grid(p1, p2, align = "h")

ggforce - cool functions

ggplot(mtcars) +
  geom_point(aes(x = hp, y = qsec, color = factor(gear))) +
  ggforce::facet_zoom(x = gear == 4) +
  theme_bw()

ggcorrplot - visualize a correlation matrix

corr <- round(cor(mtcars), 1)
corr[1:5, 1:5]
##       mpg  cyl disp   hp drat
## mpg   1.0 -0.9 -0.8 -0.8  0.7
## cyl  -0.9  1.0  0.9  0.8 -0.7
## disp -0.8  0.9  1.0  0.8 -0.7
## hp   -0.8  0.8  0.8  1.0 -0.4
## drat  0.7 -0.7 -0.7 -0.4  1.0
p_mat <- ggcorrplot::cor_pmat(mtcars)
p_mat[1:5, 1:5]
##               mpg          cyl         disp           hp         drat
## mpg  0.000000e+00 6.112687e-10 9.380327e-10 1.787835e-07 1.776240e-05
## cyl  6.112687e-10 0.000000e+00 1.802838e-12 3.477861e-09 8.244636e-06
## disp 9.380327e-10 1.802838e-12 0.000000e+00 7.142679e-08 5.282022e-06
## hp   1.787835e-07 3.477861e-09 7.142679e-08 0.000000e+00 9.988772e-03
## drat 1.776240e-05 8.244636e-06 5.282022e-06 9.988772e-03 0.000000e+00
p_corr <- ggcorrplot::ggcorrplot(corr, method = "circle", p.mat = p_mat)
print(p_corr)

ggrepel - repel overlapping text labels

ggplot(mtcars2, aes(x = qsec, y = mpg)) +
  geom_point(aes(color = factor(gear), size = hp)) +
  ggrepel::geom_text_repel(aes(label = rownames(mtcars)), color = "black")

Additional themes

ggsci - a collection of color palettes inspired by (mainly) scientific journals

Nature Publishing Group color palette:

p1 <- ggplot(mtcars) +
  geom_point(aes(x = hp, y = qsec, color = factor(cyl))) +
  theme_bw() +
  ggsci::scale_color_npg()

p2 <- ggplot(mtcars) +
  geom_histogram(aes(x = qsec, fill = factor(cyl)), colour = "black", binwidth = 1, position = "dodge") +
  theme_bw() +
  ggsci::scale_fill_npg()

cowplot::plot_grid(p1, p2, align = "h")

hrbrthemes - themes and color palettes

ggplot(mtcars) +
  geom_point(aes(x = hp, y = qsec, color = factor(cyl))) +
  hrbrthemes::theme_modern_rc()

see - themes and color palettes

ggplot(mtcars) +
  geom_point(aes(x = hp, y = qsec, color = factor(cyl))) +
  see::theme_abyss()

cowplot

Not really a collection of e.g. colors, but useful visual helpers with explanation.

Other useful libraries

janitor - table summaries (RIP table())

Personally, I am mainly using these functions:

tabyl() (vignette) creates a frequency table. It’s a replacement for base R table(), which has a weird return value with horrible API.

janitor::tabyl(mtcars, cyl)
janitor::tabyl(mtcars, cyl, gear)
janitor::tabyl(mtcars, cyl, gear, am)
## $`0`
##  cyl  3 4 5
##    4  1 2 0
##    6  2 2 0
##    8 12 0 0
## 
## $`1`
##  cyl 3 4 5
##    4 0 6 2
##    6 0 2 1
##    8 0 0 2

adorn_*() append row and/or column with summary statistics.

janitor::tabyl(mtcars, cyl, gear) %>%
  janitor::adorn_totals(c("row", "col")) %>%
  janitor::adorn_percentages("row") %>%
  janitor::adorn_pct_formatting(rounding = "half up", digits = 0) %>%
  janitor::adorn_ns() %>%
  as.data.frame()

get_dupes() returns duplicated rows.

janitor::get_dupes(mtcars, mpg, hp)

plotly - interactive HTML plots

plotly is a JavaScript library with wrappers for R and Python. It is somewhat similar to ggplot2 in terms of The Grammar of Graphics. Output is a HTML file or in case of RMarkdown, it can be directly embedded as chunk output.

We want go into details, so if you are interested, see examples. We will try to replicate the mtcars scatterplot done in ggplot2:

p_ggplot2 <- ggplot(mtcars2, aes(x = qsec, y = mpg)) +
  geom_point(aes(color = factor(gear), size = hp)) +
  theme_bw()

print(p_ggplot2)

p_plotly <- plotly::plot_ly(
  data = mtcars,
  x = ~qsec, y = ~mpg, color = ~factor(gear), size = ~hp,
  type = "scatter", mode = "markers",
  text = ~glue("mpg: {mpg}<br>qsec: {qsec}<br>gear: {gear}<br>hp: {hp}")
)

p_plotly

plotly can also convert ggplot2 objects, but it’s not always accurate.

plotly::ggplotly(p_ggplot2)

But this seems to be quite good, even better than p_plotly.

plotly produces objects inheriting from htmlwidget class. We can use htmlwidgets package to save plotly object to HTML file.

class(p_plotly)
## [1] "plotly"     "htmlwidget"
htmlwidgets::saveWidget(p_plotly, "p_plotly.html")

p_plotly.html

heatmaply - interactive HTML heatmaps

heatmaply uses plotly as a backend. Its API is mostly compatible with the classic heatmap.2() function.

heatmaply::heatmaply(mtcars)

It has also functionality similar to ggcorrplot:

cor <- psych::corr.test(mtcars)

heatmaply::heatmaply_cor(
  cor$r,
  node_type = "scatter",
  point_size_mat = -log10(cor$p), 
  point_size_name = "-log10(p-value)",
  label_names = c("x", "y", "Correlation")
)

pheatmap - pretty heatmaps in base R

The ordinary heatmap() function in R has several drawbacks when it comes to producing publication quality heatmaps. It is hard to produce pictures with consistent text, cell and overall sizes and shapes. The function pheatmap() tries to alleviate the problems by offering more fine grained control over heatmap dimensions and appearance.

You can find a bioinformatic use case here.

annotation_col <- mtcars[, c("cyl", "gear")]
head(annotation_col)
hclust_cols <- hclust(dist(t(mtcars)), method = "complete") %>%
  dendextend::cutree(tree = as.dendrogram(.), k = 2)
annotation_row <- data.frame(
  Cluster = if_else(hclust_cols == 1, "Cluster 1", "Cluster 2"),
  row.names = colnames(mtcars)
)
head(annotation_row)
pheatmap::pheatmap(t(mtcars), annotation_col = annotation_col, annotation_row = annotation_row)

Parallelization

BiocParallel package provides easy-to-use functions for parallel execution. We won’t go into details and just introduce probably the most used function bplapply(), a parallelized version of lapply().

As you know, lapply(X, FUN, ...) applies FUN with optional arguments ... to each element of X:

x <- 1:5
fn <- function(i) i^2
lapply(x, fn)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25

bplapply(X, FUN, ..., BPREDO = list(), BPPARAM = bpparam()) does the same, but each element is processed in parallel. Result is guaranteed to have the same order of elements as in X.

BiocParallel::bplapply(x, fn)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25

Defaultly, all available cores are used. To control for resources, use the BPPARAM argument.

(bpparam <- BiocParallel::MulticoreParam(workers = 2))
## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK
BiocParallel::bplapply(x, fn, BPPARAM = bpparam)

Parallel processing is hard to debug, even with try-catch equivalent function bptry(). Also, you can’t use browser() inside FUN. But sometimes is just enough to run bplapply() in non-parallel (serial) mode. This is possible with SerialParam().

(bpparam <- BiocParallel::SerialParam())
fn <- function(i) {
  browser()
  i^2
}

BiocParallel::bplapply(x, fn, BPPARAM = bpparam)

BiocParallel supports jobs in distributed systems and much more, but for now we can live only with the simple multicore usage.

Cleanup

save.image("intro_to_advanced_R.RData")

warnings()
traceback()
## No traceback available
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=cs_CZ.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=cs_CZ.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.2.1    stringr_1.4.0    dplyr_0.8.3      magrittr_1.5    
## [5] glue_1.3.1       knitr_1.24       rmarkdown_1.15.1
## 
## loaded via a namespace (and not attached):
##   [1] nlme_3.1-143        bitops_1.0-6        insight_0.8.0      
##   [4] webshot_0.5.1       RColorBrewer_1.1-2  httr_1.4.1         
##   [7] ggsci_2.9           tools_3.6.2         backports_1.1.4    
##  [10] R6_2.4.0            KernSmooth_2.23-16  lazyeval_0.2.2     
##  [13] colorspace_1.4-1    withr_2.1.2         tidyselect_0.2.5   
##  [16] gridExtra_2.3       GGally_1.4.0        mnormt_1.5-5       
##  [19] compiler_3.6.2      extrafontdb_1.0     Cairo_1.5-10       
##  [22] TSP_1.1-7           see_0.4.0           plotly_4.9.0       
##  [25] labeling_0.3        bayestestR_0.4.0    caTools_1.17.1.2   
##  [28] scales_1.0.0        psych_1.8.12        ggridges_0.5.1     
##  [31] digest_0.6.20       foreign_0.8-74      pkgconfig_2.0.2    
##  [34] htmltools_0.3.6     extrafont_0.17      htmlwidgets_1.3    
##  [37] rlang_0.4.0         rstudioapi_0.10     shiny_1.3.2        
##  [40] farver_1.1.0        jsonlite_1.6        crosstalk_1.0.0    
##  [43] BiocParallel_1.18.1 gtools_3.8.1        dendextend_1.12.0  
##  [46] parameters_0.4.0    Rcpp_1.0.2          munsell_0.5.0      
##  [49] gdtools_0.1.9       viridis_0.5.1       lifecycle_0.1.0    
##  [52] stringi_1.4.3       yaml_2.2.0          snakecase_0.11.0   
##  [55] MASS_7.3-51.5       gplots_3.0.1.1      plyr_1.8.4         
##  [58] grid_3.6.2          hrbrthemes_0.6.0    promises_1.0.1     
##  [61] parallel_3.6.2      gdata_2.18.0        ggrepel_0.8.1      
##  [64] crayon_1.3.4        lattice_0.20-38     cowplot_1.0.0      
##  [67] zeallot_0.1.0       pillar_1.4.2        ggpubr_0.2.2       
##  [70] ggsignif_0.6.0      effectsize_0.0.1    reshape2_1.4.3     
##  [73] codetools_0.2-16    gclus_1.3.2         evaluate_0.14      
##  [76] data.table_1.12.2   httpuv_1.5.1        vctrs_0.2.0        
##  [79] tweenr_1.0.1        foreach_1.4.7       Rttf2pt1_1.3.7     
##  [82] gtable_0.3.0        purrr_0.3.2         polyclip_1.10-0    
##  [85] tidyr_1.0.0         reshape_0.8.8       heatmaply_0.16.0   
##  [88] assertthat_0.2.1    xfun_0.9            ggforce_0.3.1      
##  [91] mime_0.7            janitor_1.2.0.9000  xtable_1.8-4       
##  [94] later_0.8.0         ggcorrplot_0.1.3    viridisLite_0.3.0  
##  [97] seriation_1.2-7     tibble_2.1.3        pheatmap_1.0.12    
## [100] iterators_1.0.12    registry_0.5-1      cluster_2.1.0      
## [103] ellipsis_0.2.0.1

HTML rendering

This chunk is not evaluated (eval = FALSE). Otherwise you will probably end up in recursive hell :)

library(rmarkdown)
library(knitr)
library(glue)

# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE)
render("intro_to_advanced_R.Rmd", output_file = "intro_to_advanced_R.html")